Pivoting: it’s a verb from our every day life used to describe the movement of an object or a person. Well datasets can pivot too ! Today we are going to see why pivoting them is tool for efficient and high-level.
You will understand what a wide data format is, and what a long data format is
You will know how to pivot long data to wide data using
pivot_long()
You will know how to pivot wide data to long data using
pivot_wider()
You have the intuition why the long data format is easier for plotting and wrangling
We will see many different ones. So we will introduce them as we go along. But here is a brief overview:
Worldwide, number of infant deaths (under 1 years old), per country, per year
Worldwide, number of births, per country, per year
European-scale, number of births, per country, per year
Worldwide, number of malaria cases, per country, per year
Worldwide, number of HIV cases, per country, per year
So many datasets! Now let’s learn about their formats:
“Wide” data means that each data entry is a unique observation unit. This observation is the subject of further measurements or information collection. A unique observation unit can be a person, a country, a laboratory sample etc.
“Long” data means that each data entry is a data collection event. If the same observation unit had data collected in three ways, then it will have three data entries (3 rows).
Based on this explanation, “wide” and “long” formats are the same dataset if data collection only occurred once for each unique observation unit.
Let’s see an example of “wide” vs “long” data format for the same dataset. Imagine we are handling a timeseries of patients whose blood pressure you have recorded everyday.
You can record the data in a “wide” format like this:
Fig: “wide” dataset for a timeseries of patients.
Each unique observation unit (each patient) has only one row. The events (different measuring days) occupy different columns. You could say that the focus is on the unique observations(the patients).
Or you could record the data in a “long” format as so :
Fig: “long” dataset for a timeseries of patients.
Here the unique days of recording (3 different events) define the data entries (the rows) of the data set. There are multiple rows per patient (the units of observation). Here the focus is on the time points / the events, the days.
Of course, the different collection ways do not have to be events / time points. They can be localizations (different neighborhoods, different cities) or any other variables which can stratify (divide into subgroups) the main units of observations.
Fig: “wide” dataset where the unique observation unit is a lab sample and the different collection ways are different sampling sites.
Fig: “long” dataset where the unique observation unit is a lab sample and the different collection ways are different sampling sites.
Likewise, the unit of observation does not have to a patient. It can be a country (as presented here), or any other entity.
Let’s load the number of Nipah cases in different regions of malaysia.
nipah <- outbreaks::nipah_malaysia %>% as_tibble()
nipahIs this a wide or long data format?
# Q_nipah_type <- "_____"
# What is your unique observational unit ?
# Q_nipah_observations <- "_____"
# The truth is: it really depends on what you want to do ! The wide format is great for displaying data because it’s easy to visually compare values of unique observations unit this way. Long data is best for a majority of data analysis tasks such as grouping, plotting.
It will be essential for you to know how to switch from one to the other easily.
Switching from the “wide” to the “long” format (or the other way around) is called pivoting.
Many datasets you’ll find in online data hubs and services will come in a wide format. The reason for this is that the wide format is a common way of inputing/entering data into tables during data collection.
Let’s have a look at Gapminder’s open source data about the Number of infant deaths (under 1 years old). Check out Gapminder: an awesome resource !
And here is how we went to find this dataset:
Fig: Navigating open-source datasets on Gapminder. Here we searched for “Number of infant deaths.”
Let’s read in the CSV we downloaded from Gapminder. We are going to select the years between 2010 and 2015 to keep it simple.
infant_deaths_wide <-
read_csv(here("data/gapminder_infant_deaths.csv")) %>%
select(country, x2010:x2015)## Rows: 209 Columns: 7
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (6): x2010, x2011, x2012, x2013, x2014, x2015
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
infant_deaths_wideFollowing our example above, we observe that each row corresponds to a unique observation unit: a country. The different events of data collection per country are different years: they are spread across different columns. Hence this dataset is in a “wide” format.
The numbers in each box of country x year dataset are the counts of infant deaths for that country and for that year. As an example, in 2010, there were 74600 infant deaths recorded for Afghanistan.
If we wanted to convert the data format to “long” format then we can
use a convenient function pivot_longer :
infant_deaths_wide %>%
pivot_longer(cols = x2010:x2015)Within pivot_longer we define, using the
cols argument, which columns we want to pivot.
We observe that the “long” format dataset now has each country
occupying 5 rows (one per year between 2010 and 2015). The years have
been pivoted into a single variable (names) instead of each
representing their own variable (x2010:x2015). All the
numbers in the country x years dataset were pivoted into their own
variable (values). The infant deaths used to be in matrix
format (2D: 2 dimensions), now they are in vector format (1D: 1
dimension).
This long dataset will be much more handy for most data analysis functions. Intuitively, manipulating 1D is always easier than manipulating 2D. More on this later !
However, as good data analysts, you might be cringing at the new
names of your variables: names and values are
generic and do not represent our data at all ! No worries, you can give
custom column names:
infant_deaths_wide %>%
pivot_longer(cols = x2010:x2015,
names_to = "year",
values_to = "deaths_count")The cols argument defining the variables chosen for
pivot are the same. However names_to allows to define the
variable name of the 1-D vector now regrouping your multiple pivoted
columns. As you are pivoting years, it makes sense to call their
variable year. We apply the same reasoning for the number
of infant deaths pivoted from a matrix to a sole column: we name the
variable embodying this column deaths_counts.
One could argument that the “long” format is more informative than
the “wide” format. Why? Because of these column names. In the “wide”
format, unless the CSV is named count_infant_deaths (or
something alike) or unless someone tells you “these are the counts of
infant deaths per country and per year”, you have no idea what those
numbers represent.
You may be a bit frustrated by the weird x in front of
your years. Here is how you would remove it using
parse_number() function:
infant_deaths_wide %>%
pivot_longer(cols = x2010:x2015,
names_to = "year",
values_to = "deaths_count") %>%
mutate(year = parse_number(year))We will see more on this function in the lessons dedicated to strings. But as a brief explanation: it extracts numbers from strings.
For later, let’s store this “long” format dataset of the count of infant deaths per country and per year:
infant_deaths_long <-
infant_deaths_wide %>%
pivot_longer(cols = x2010:x2015,
names_to = "year",
values_to = "deaths_count")For this practice question, we will use the
euro_births_wide dataset from Eurostat.
It contains data on births in 50 European countries: the crude birth
rate per thousand people, collected on an annual basis.
euro_births_wide <-
read_csv(here("data/euro_births_wide.csv"))## Rows: 50 Columns: 8
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (7): x2015, x2016, x2017, x2018, x2019, x2020, x2021
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
euro_births_wideWhat is the unique observational unit? Your answer should be a string giving the name of the variable.
# Q_euro_births_unique_observational_unit <- "_____"
# What variable defines the data collection in multiple ways (i.e. the potential stratification variable) ? Your answer should be a string giving the name of the variable.
# Q_euro_births_multiple_ways <- "_____"
# The data is in a “wide” format. Convert it to a “long” format.
# Q_euro_births_long_format <-
# euro_births_wide %>%
# pivot_longer(___________)Wide data often comes from external sources as we have seen above. Long data, on the other hand, is likely to be created by you in the course of your data wrangling. Let’s see an example:
We will use yet another dataset: patient records from an Ebola outbreak in Sierra Leone in 2014.
ebola <- outbreaks::ebola_sierraleone_2014 %>% as_tibble()
ebolaHere each row corresponds to a unique unit of observation: one ebola patients. The data is in a “wide” format.
You may want to count how many cases there were in each year across
districts. For this, you would apply the summary()
functions on your datasets:
ebola_summary <-
ebola %>%
summary()
ebola_summaryA summary is considered a long format because it no longer has the structure one row = one observation unit.
Let’s see an example where you keep a wide format, but you can the unique observation unit. This occurs when we count the number of patients in each district.
ebola_district_counts <-
ebola %>%
count(district)
ebola_district_countsNow, one row = one observation, so we are still in a wide format, but the unique observation units are now the districts and not the patients themselves.
Let’s now explore another, somewhat similar, data wrangling manipulation which will change your data format from “wide” to “long”: grouping by multiple variables then counting number of recorded patients within groups.
ebola_yearly_counts <-
ebola %>%
mutate(year = lubridate::year(date_of_onset)) %>%
group_by(district, year) %>%
count()
ebola_yearly_countsHere we start with extracting the year information (year of the date)
from the variable date_of_onset. For this we use a useful
function year() from the package {lubridate}, which you
will see in more details in further lessons. We then group by
district and year to count the number of
patients recorded per district and per year.
Because of this nested grouping, we are generating a long format dataset: we do not have one row = one unique observation. Both districts and years take up multiple rows. Each district takes up 2 rows (because their count is calculated for 2 years: 14 x 2 = 28) and each year takes up 14 rows (because their count is calculated over 14 districts: 2 x 14 = 28).
This is a perfect illustration of how one data wrangling manipulation will preserve a wide format, while another will generate a long format.
What if we wanted to switch ebola_yearly_counts back to
a wide format ? There is an easy function for this:
pivot_wider() (similar to pivot_longer()).
Let’s switch our data back to a wide format:
ebola_yearly_counts %>%
pivot_wider(values_from = n,
names_from = year)pivot_wider() has two important arguments:
values_from and names_from.
values_from defines which values will become the core of
the wide data format (in other words: which 1D vector will become a 2D
matrix). names_from identifies which variable to use to
define column names in the wide format.
To explain with our dataset ebola_yearly_counts. We
define our observation unit as the districts. So to have a wide format,
we need the variable years to become two columns:
2014 and 2015, the unique values of the
variable. We define this in pivot_wider() with
names_from. This would then create a dataframe in the
format districts x years, in other words, a matrix. You need to fill the
matrix with values, hence the argument values_from, which
indicates we want to fill the matrix with count values.
Now, imagine we wanted to see the years as our unique observation
unit. We would do exactly the same, but our 2nd dimension to the wide
format would be the districts:
ebola_yearly_counts %>%
pivot_wider(values_from = n,
names_from = district)Here our unique observation units (our rows) are now the years (2014, 2015). Our 2nd dimension (our columns) are the districts. Our matrix input are the count values organized in the format years x districts.
That’s it! We’ll see more complex examples in the next lesson…
Above, using Gapminder’s data on the number of infant deaths, we
transformed the wide format dataset infant_deaths_wide into
the long format dataset infant_deaths_long. Now, using
pivot_wider(), let’s switch it back to its wide format
again. We want the unique observation unit to be the countries and to
have a countries x years matrix, filled with the death counts.
# Q_infant_death_wide_countries <-
# infant_death_long %>%
# pivot_wider(___________)Using again pivot_wider() and again
infant_death_long, let’s switch it back to its wide format
but this time we want the unique observation unit to be the years and to
have a years x countries matrix, filled with the death counts.
# Q_infant_death_wide_years <-
# infant_death_long %>%
# pivot_wider(___________)Above we mentioned that long data is best for a majority of data analysis tasks. Now we can justify why.
First, let’s talk about filtering grouped data. Some filter operations are difficult to do on wide data.
Here is an example taking the infant deaths dataset. We want to answer the following question: For each country, which year had the highest number of child deaths?
This is how we would do so with the long format of the data :
infant_deaths_long %>%
group_by(country) %>%
filter(deaths_count == max(deaths_count))Easy right? We can easily see that Afghanistan had its maximal infant death count in 2010.
If you wanted to do the same thing with wide data, you would need
some weird functions like rowwise():
infant_deaths_wide %>%
rowwise() %>%
mutate(max_count = max(x2010, x2011, x2012, x2013, x2014, x2015))Okay it works, but we still don’t know which year is attached to that
value in max_count. We would have to take that value and
index it back to its respective year column… what a hassle! There are
solutions to find this but all are very painful. Why make your life
complicated when you can just pivot to long format and use the beauty of
group_by() and filter()?
Here we use a special {dplyr} function: rowwise().
rowwise() is a function which allows further operations to
be applied on the rows rather than on the columns. As you can see, here
mutate() creating a maximum value column, takes the maximum
of each row (not the maximum of each column.
Without rowwise() you would get this :
infant_deaths_wide %>%
mutate(max_count = max(x2010, x2011, x2012, x2013, x2014, x2015))The max count over all columns.
Using the Eurostat dataset, euro_births_wide, wrangle to
have the maximal birth count over all years, per country. (like above,
you should think of pivoting, then using group_by() and
filter())
# Q_euro_births_max <- "_____"
# Like filtering, most summarizing operations will be difficult to do on wide data. If we pause and think why it is so hard to filter or summarize on wide data, we realize that it is because the wide data prevents us from grouping our data. A data format where each row is a unique observation unit is a data format where each group is a unique observation unit. A long format allows to group data.
For example, if you want to ask: For each country, between 2015 and 2021, what was the mean number of infant deaths and the standard deviation (variation) in deaths ?
With long data it is simple:
infant_deaths_long %>%
group_by(country) %>%
summarize(mean_deaths = mean(deaths_count),
sd_deaths = sd(deaths_count))With wide data, on the other hand, finding the mean is doable…
infant_deaths_wide %>%
rowwise() %>%
mutate(mean_deaths = sum(x2010, x2011, x2012,
x2013, x2014, x2015, na.rm = T)/6) But standard deviation would be more difficult.
Using the Eurostat dataset, euro_births_wide, wrangle to
have the mean number and variation (standard deviation) of birth over
all years, per country. (like above, you should think of pivoting, then
using group_by() and summarize())
# Q_euro_births_mean_sd <- "_____"
# Joining related datasets is often easier with long data.
To illustrate this, let us load another dataset: the number of children born each year (also another jewel dataset from Gapminder!).
infant_births_wide <-
read_csv(here("data/gapminder_new_births.csv"))## Rows: 201 Columns: 7
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (6): x2010, x2011, x2012, x2013, x2014, x2015
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
infant_births_wideThe dataset looks very similar to the one about infant deaths, only this time, we are viewing the counts of births.
You may want to join this information on births with the number of deaths. With this joint dataset, we could calculate the infant mortality rate for each year
With long data these operations are easy. First pivot new data to long:
infant_births_long <-
infant_births_wide %>%
pivot_longer(cols = x2010:x2015,
values_to = "births_count",
names_to = "year")
infant_births_longThen join using left_join():
infant_deaths_long %>%
left_join(infant_births_long)## Joining, by = c("country", "year")
The mortality rate is the number of deaths on the number of births.
You can easily calculate the mortality rate for each year using
mutate().
infant_deaths_long %>%
left_join(infant_births_long) %>%
mutate(infant_mortality_rate = deaths_count/births_count,
infant_mortality_rate_per_cent = infant_mortality_rate * 100)## Joining, by = c("country", "year")
Continuing the use of Gapminder, we will load in 2 more datasets that you will then join together.
The first is a count of HIV cases per year and per country.
hiv_counts <-
read_csv(here("data/gapminder_hiv.csv"))## Rows: 149 Columns: 23
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (23): country, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 20...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hiv_countsThe second is a count of malaria cases per year and per country.
malaria_counts <-
read_csv(here("data/gapminder_malaria.csv"))## Rows: 100 Columns: 18
## ── Column specification ───────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): country, 1990, 1997, 1999, 2000, 2002, 2003, 2004, 2005, 2006
## dbl (8): 1991, 1992, 1993, 1994, 1995, 1996, 1998, 2001
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
malaria_countsWe are going to join these two datasets together after pivoting them into long format.
# Q_malaria_hiv_comparison <-
# hiv_counts_long %>%
# left_join(malaria_counts_long)After joinging them, filter for the country of Vietnam and for the year 2000. In 2000, did Vietnam have more cases of malaria or HIV ? Your answer should be a string (either “malaria” or “HIV”).
# Q_Vietnam_2000 <- "_______"Finally, one of the data analysis tasks that is MOST hindered by wide formats is plotting. You may not yet have any prior knowledge of {ggplot} and how to plot so we will see the figures without going in depth with the code. What you need to remember is: many plots with with ggplot are also only possible with long-format data
Consider again the infant_deaths data
infant_deaths_long. We will plot the number of deaths for
Belgium per year:
infant_deaths_long %>%
filter(country == "Belgium") %>%
ggplot() +
geom_col(aes(x = year, y = deaths_count))The plotting works because we can give the variable year
for the x-axis. In the long format, year is a variable
variable of its own. In the wide format, each year is a column and your
counts are a matrix, not a vector. Plots need vectors of data!
Another plot that would not be possible without a long format:
infant_deaths_long %>%
head(30) %>%
ggplot(aes(x = year, y = deaths_count, group = country, color = country)) +
geom_line() +
geom_point()Once again, the reason is the same, we need to tell the plot what to use as an x-axis and a y-axis and it is necessary to have these variables in their own column (as organized in the long format).
You have now explored different datasets and how they are either in a long or wide format. In the end, it’s just about how you present the information. Sometimes one format will be more convenient, and other times another could be best. Now, you are no longer limited by the format of your data: don’t like it? change it !
The following team members contributed to this lesson: